Colab#4
Draft
ironmanizawesome wants to merge 15 commits into
Draft
Conversation
PCBClassNet.build() was passing the (model, learning_layer1, learning_layer2) tuple straight into get_classification, which expects a single Keras Model. Unpack so the classification head receives the encoder model as intended, making the classification path actually buildable. Also adds CLAUDE.md (project guidance) and ignores .claude/ working state plus training log files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds notebooks/colab_train.ipynb covering the full pipeline (clone, TF 2.10 pin, Drive mount, data unzip, seg + class training with checkpoint backup to Drive) so an 8 GB local GPU isn't a blocker. Pins TF 2.10.1 + keras 2.10 + protobuf 3.19.6 in the install cell — Colab's bundled TF (2.15 with Keras 3) breaks `tf.keras.activations.softmax` calls and a few other patterns this codebase relies on. notebooks/README.md captures the data zip layout, why TF 2.10, and a VRAM cheat sheet for the common Colab GPUs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Colab's default Python is 3.12, which has no TF 2.10 wheels available (`pip install tensorflow==2.10.1` fails with "No matching distribution"). Insert a condacolab.install() step that swaps the kernel to a Python 3.10 base, then install the verified TF 2.10 stack on top. The kernel auto-restarts after condacolab.install(); the cloned repo on /content survives the restart so subsequent cells just resume. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restructures the notebook so the entire data prep pipeline runs in Colab from the raw FPIC archive (~7 GB) instead of requiring a pre-zipped processed dataset (~18 GB): - §4 unzips data_raw.zip (pcb_image + smd_annotation) - §5 runs create_mask.py (GPU-accelerated EDSR upscaling) - §6 runs create_patches.py (768 px patches + 80/20 train/val split) - §§7-10 unchanged training/eval flow with section numbers shifted Caps full training at 40 epochs for both segmentation and classification. Colab Pro caps a single session at 24 h with a 90-min idle limit and no background execution; Seg 100 + Class 100 (~30-37 h) cannot fit. Seg 40 + Class 40 fits comfortably in roughly 12 h on a T4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The latest condacolab defaults to Python 3.11, which TF 2.10 also has no wheels for (only 3.7–3.10). Pass python_version="3.10" so the kernel restart lands on a Python 3.10 base that the TF 2.10 install can match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the condacolab Python 3.10 dance. Colab's default Python keeps
moving past TF 2.10's wheel matrix (now 3.11/3.12), and the latest
condacolab doesn't accept python_version on install_miniforge. TF 2.15
is the last TF release on Keras 2 (Keras 3 starts at TF 2.16) and
ships wheels for the Python versions Colab actually serves, so the
codebase's tf.keras.backend.{dot,transpose} usage keeps working with
no source changes.
Also rewrites the notebook from scratch to clean up duplicate cells
that crept in during incremental NotebookEdit changes (two ## 6 / ## 7
sections, both 100- and 40-epoch training cells, missing sanity cells).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plain `pip install tensorflow==2.15.0` on Colab falls back to CPU because Colab's bundled CUDA libs are pinned to whatever TF version Colab ships, not 2.15. The `[and-cuda]` extra pulls in matching nvidia-cudnn-cu12 / cublas-cu12 / etc. wheels alongside TF, which is what TF's GPU loader actually expects to dlopen. Without this, training falls back to CPU and create_mask.py / train_*.py take ~10× longer with periodic "Cannot dlopen some GPU libraries" warnings in stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oken) tensorflow[and-cuda]==2.15.0 fails to resolve because the extra pins tensorrt-libs==8.6.1, which has been removed from PyPI (only 9.x is still available). Drop the bracket extra and install nvidia-cudnn-cu12, nvidia-cublas-cu12, etc. by name in a separate pip call. TF needs them at dlopen time but doesn't actually use TensorRT for training. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Colab's notebook kernel runs on Python 3.12, but TF 2.15 only ships wheels for Python 3.9–3.11. Colab images already include /usr/local/bin/python3.11; install the TF 2.15 stack into that interpreter and run create_mask.py / create_patches.py / train_*.py via !python3.11 instead of !python. The notebook kernel itself stays on Python 3.12 — we never import tensorflow from kernel cells, just shell-out to python3.11 for everything that touches TF. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cell DISLoss's SSIM gradient backward path spikes a 416 MB tensor ([batch=16, 26 classes, 512, 512]) that fragments allocator on T4 16 GB GPUs and OOMs even though plenty of free memory exists. TF itself recommends `cuda_malloc_async` in this case. Add it as a prefix to every train/eval invocation so the recommendation actually fires; on L4 24 GB it's redundant but harmless. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
L4 Colab images don't ship python3.11 (T4 ones do). Add a guard that installs python3.11 from deadsnakes PPA when it's missing. Pin every nvidia-*-cu12 wheel to the version TF 2.15 expects to dlopen: - nvidia-cudnn-cu12==8.9.4.25 (latest is 9.x; TF 2.15 needs libcudnn.so.8) - nvidia-cublas-cu12==12.2.5.6 etc. Without these pins TF 2.15 falls back to CPU on a fresh runtime because it can't find the right .so versions, and the warnings are easy to miss. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
40 was too aggressive a cut from the paper's 100. 80 is the sweet spot: enough room for ReduceLROnPlateau (patience=15) to fire and fine-tune, while still fitting inside Colab Pro's 24 h session limit (~9h per model on L4 = 18h total + preprocessing buffer). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first 80-epoch segmentation run lands val_dice around 0.71 with train dice 0.92 (clear overfit) and lr already at min_lr=1e-5. Add a second optional stage that resumes from best_seg.h5 with a lower lr range so ReduceLROnPlateau can keep stepping down past 1e-5. Changes: - train_segmentation.py: -resume CLI flag; when set, model.load_weights is called on the configured checkpoint path before fit(). - src/cfs/pscn_seg_finetune.yml: same architecture as pscn_seg.yml but lr=1e-5 (where the first run left off) and min_lr=1e-6. - notebooks/colab_train.ipynb: new §8b that restores best_seg.h5 from Drive if missing, runs 20 epochs with -resume + the finetune config, then re-mirrors the best checkpoint. - .gitignore: ignore /best_*.h5 and root-level *.zip (Colab artifacts that landed in the working tree). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings the fine-tune second stage into the main Colab branch so users running the notebook always have §8b available without switching branches.
TF 2.15 lazy-loads tf.keras, and accessing __version__ on it raises AttributeError mid-cell, swallowing the GPU print that follows. Print TF version + GPU list only; users who specifically need the keras version can run it in a separate cell. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.